One of the most interesting Unsupervised Machine Learning techniques is clustering. Clustering helps us find interesting ways to group items based on how similar they are. One useful application of clustering is in higher education. For example, consider two schools in my home state of Ohio. Should we think of the College of Wooster as a comparable school to the University of Dayton? Both are private colleges that enroll less than 10,000 students, have a healthy share of out-of-state students, and charge relatively high (>$40,000) tuition. When we’re classifying universities, should they be in the same group?
For this analysis, I’ll attempt to classify schools based on their selectivity. ‘Selectivity’ is a vague metric, but there are a handful of meaningful statistics that represent selectivity based on university prestige and academic rigor. These are metrics such as tuition rates, admission rates, ACT scores for incoming students, and total enrollment.
For this analysis, we will make use of the IPEDS (Integrated Postsecondary Education Data System) database, a government source for higher education data. The latest data covers the 2019-2020 school year.
IPEDS publishes a Microsoft Access database every year with hundreds of variables covering thousands of higher education institutions. To simplify things, I exported the relevant tables from Access to Excel, to streamline the analysis & move away from the bulky Access database.
For this analysis, we need the tidyverse as well as the readxl package to load the Excel data.
# Read data
institution <- read_excel("data\\HD2021.xlsx")
adm <- read_excel("data\\ADM2021.xlsx")
adm_yield <- read_excel("data\\DRVADM2021.xlsx")
enrollment <- read_excel("data\\DRVEF2021.xlsx")
enrollment_d <- read_excel("data\\EF2021D.xlsx")
tuition <- read_excel("data\\IC2021_AY.xlsx")
We will require data on multiple categories. Note that IPEDS stores each variable with a mnemonic - so note that I have taken steps to translate each mnemonic into a more informative variable name.
#Institutional characteristics
institution <- institution %>%
select(UNITID, INSTNM, ICLEVEL, CONTROL, LATITUDE, LONGITUD, ADDR, CITY, STABBR, ZIP, CBSA) %>%
rename(school = INSTNM)
#Admissions
adm <- adm %>%
select(UNITID, ACTCM75) %>%
rename(act_score = ACTCM75
)
#DRAdmissions table
adm_yield <- adm_yield %>%
select(UNITID, DVADM01) %>%
rename(pct_admitted = DVADM01)
#Enrollment characteristics
enrollment <- enrollment %>%
select(UNITID, EFUG, EFUGFT, EFUGPT, PCUDEEXC, RMINSTTP, RMOUSTTP, RMFRGNCP) %>%
rename(ug_enroll = EFUG,
ft_enroll = EFUGFT,
pt_enroll = EFUGPT,
pct_online = PCUDEEXC,
pct_instate = RMINSTTP,
pct_outstate = RMOUSTTP,
pct_foreign = RMFRGNCP)
enrollment_d <- enrollment_d %>%
select(UNITID, RET_PCF) %>%
rename(retention_rate = RET_PCF)
enrollment <- left_join(enrollment, enrollment_d, by = c("UNITID" = "UNITID"))
#Tuition
tuition <- tuition %>%
select(UNITID, TUITION2, TUITION3) %>%
rename(in_state_tuition = TUITION2,
out_state_tuition = TUITION3)
Now we can join all this data together into a main data frame and filter the resulting data to only include 4-year, non-profit schools (ie, exclude 2 year/technical colleges and for-profit institutions, as those are outside the scope of this analysis).
###Join it all together###
school_data <- institution %>%
left_join(y = adm, by = c("UNITID" = "UNITID")) %>%
left_join(y = adm_yield, by = c("UNITID" = "UNITID")) %>%
left_join(y = enrollment, by = c("UNITID" = "UNITID")) %>%
left_join(y = tuition, by = c("UNITID" = "UNITID"))
#And limit the analysis to 4 year private & public non-profit institutions
school_data <- school_data %>%
filter(ICLEVEL == 1,
CONTROL != 3) %>%
select(-ICLEVEL)
#Let's create a key linking each school ID to a school, for later
school_id_key <- school_data %>%
select(UNITID, school)
We now have a primary data frame. Let’s take a look at it.
school_data <- school_data %>%
mutate(UNITID = as.character(UNITID))
summary(school_data)
## UNITID school CONTROL LATITUDE
## Length:2487 Length:2487 Min. :1.000 Min. :-14.32
## Class :character Class :character 1st Qu.:1.000 1st Qu.: 34.43
## Mode :character Mode :character Median :2.000 Median : 39.60
## Mean :1.667 Mean : 37.97
## 3rd Qu.:2.000 3rd Qu.: 41.78
## Max. :2.000 Max. : 71.32
##
## LONGITUD ADDR CITY STABBR
## Min. :-170.74 Length:2487 Length:2487 Length:2487
## 1st Qu.: -96.59 Class :character Class :character Class :character
## Median : -84.88 Mode :character Mode :character Mode :character
## Mean : -88.82
## 3rd Qu.: -76.70
## Max. : 171.38
##
## ZIP CBSA act_score pct_admitted
## Length:2487 Min. : -2 Min. :15.00 Min. : 0.00
## Class :character 1st Qu.:18520 1st Qu.:24.00 1st Qu.: 63.00
## Mode :character Median :31080 Median :27.00 Median : 78.00
## Mean :29096 Mean :27.01 Mean : 72.81
## 3rd Qu.:38980 3rd Qu.:30.00 3rd Qu.: 89.00
## Max. :49780 Max. :36.00 Max. :100.00
## NA's :1501 NA's :835
## ug_enroll ft_enroll pt_enroll pct_online
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.00
## 1st Qu.: 331 1st Qu.: 216 1st Qu.: 20 1st Qu.: 0.00
## Median : 1488 Median : 1132 Median : 155 Median : 7.00
## Mean : 4302 Mean : 3163 Mean : 1139 Mean : 17.16
## 3rd Qu.: 4219 3rd Qu.: 2966 3rd Qu.: 826 3rd Qu.: 23.00
## Max. :121884 Max. :107952 Max. :85914 Max. :100.00
## NA's :82 NA's :82 NA's :82 NA's :329
## pct_instate pct_outstate pct_foreign retention_rate
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 52.00 1st Qu.: 7.00 1st Qu.: 0.000 1st Qu.: 65.00
## Median : 77.00 Median : 20.00 Median : 1.000 Median : 74.00
## Mean : 69.46 Mean : 26.73 Mean : 3.036 Mean : 73.18
## 3rd Qu.: 91.00 3rd Qu.: 42.00 3rd Qu.: 4.000 3rd Qu.: 83.00
## Max. :100.00 Max. :100.00 Max. :67.000 Max. :100.00
## NA's :1317 NA's :1317 NA's :1317 NA's :612
## in_state_tuition out_state_tuition
## Min. : 0 Min. : 0
## 1st Qu.: 7070 1st Qu.:12000
## Median :13603 Median :20304
## Mean :20566 Mean :24077
## 3rd Qu.:32800 3rd Qu.:33681
## Max. :77200 Max. :77200
## NA's :342 NA's :342
Now we can engage in some light feature engineering. One of these features will be to calculate an ‘average’ tuition rate for all students.
school_data <- school_data %>%
mutate(in_state_share = pct_instate/ (pct_instate + pct_outstate),
out_state_share = 1-in_state_share,
avg_tuition = (in_state_share * in_state_tuition) + (out_state_share * out_state_tuition)
) %>%
select(-in_state_share, -out_state_share)
summary(school_data$avg_tuition)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 7430 14700 21028 33231 62420 1323
One thing I noticed about the data in the first exploration is that the ACT score and percent admitted fields seem to have the most NAs. We can use of the plot_histogram() feature from the DataExplorer package to give a better sense of the NA data here.
## Warning: package 'DataExplorer' was built under R version 4.2.3
There doesn’t seem to be a unifying feature that links each school with
missing ACT scores. Most schools are smaller, but there are some larger
schools being left out. Accordingly, we can impute this missing data
with the average value in our data, so we aren’t losing all these
observations.
school_data <- school_data %>%
mutate(act_score = ifelse(is.na(act_score), mean(act_score, na.rm = TRUE), act_score))
Finally, k-means can only handle numeric data so we will remove any categorical features.
school_full <- school_data
school_data <- school_data %>%
filter(!is.na(act_score)) %>%
drop_na()
modeled_data <- school_data %>%
column_to_rownames(var = 'UNITID') %>%
mutate(pct_admitted = ifelse(is.na(pct_admitted), imputed_admissions, pct_admitted)) %>%
select(-ft_enroll, -pt_enroll, -pct_online, -pct_foreign, -retention_rate, -in_state_tuition, -out_state_tuition, -school,
-CITY, -STABBR, -ZIP, -CBSA, -LATITUDE, -LONGITUD, -ADDR, -CONTROL) #These variables can explain the results of the analysis but themselves don't define 'selectivity' in our analysis
plot_histogram(modeled_data)
summary(modeled_data)
## act_score pct_admitted ug_enroll pct_instate
## Min. :15.00 Min. : 4.00 Min. : 14 Min. : 0.00
## 1st Qu.:26.00 1st Qu.: 66.00 1st Qu.: 1474 1st Qu.: 49.75
## Median :27.01 Median : 79.00 Median : 2894 Median : 73.00
## Mean :27.14 Mean : 73.76 Mean : 6732 Mean : 66.54
## 3rd Qu.:29.00 3rd Qu.: 89.00 3rd Qu.: 8013 3rd Qu.: 88.00
## Max. :36.00 Max. :100.00 Max. :121884 Max. :100.00
## pct_outstate avg_tuition
## Min. : 0.00 Min. : 1156
## 1st Qu.: 10.00 1st Qu.: 9395
## Median : 24.00 Median :19129
## Mean : 29.43 Mean :24253
## 3rd Qu.: 45.00 3rd Qu.:36263
## Max. :100.00 Max. :62420
Now that we cleaned up the data, we can look at what fields will go into our k-means analysis.
colnames(modeled_data)
## [1] "act_score" "pct_admitted" "ug_enroll" "pct_instate" "pct_outstate"
## [6] "avg_tuition"
We have a nice, tidy dataset.
Now we can start analyzing our data! K-means requires that all data is on the same magnitude, so that fields with large values or wide arrays of values don’t have an unfair influence on the dataset. We can do that easily with the scale() function.
modeled_data <- modeled_data %>%
scale()
summary(modeled_data)
## act_score pct_admitted ug_enroll pct_instate
## Min. :-3.70573 Min. :-3.3577 Min. :-0.6972 Min. :-2.6124
## 1st Qu.:-0.34814 1st Qu.:-0.3734 1st Qu.:-0.5457 1st Qu.:-0.6591
## Median :-0.03981 Median : 0.2523 Median :-0.3983 Median : 0.2538
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.56756 3rd Qu.: 0.7336 3rd Qu.: 0.1329 3rd Qu.: 0.8428
## Max. : 2.70421 Max. : 1.2631 Max. :11.9499 Max. : 1.3139
## pct_outstate avg_tuition
## Min. :-1.2651 Min. :-1.4043
## 1st Qu.:-0.8352 1st Qu.:-0.9033
## Median :-0.2333 Median :-0.3115
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6695 3rd Qu.: 0.7302
## Max. : 3.0339 Max. : 2.3205
While the ‘optimal’ number of clusters is ultimately a judgment call, there are a few ways to determine how many clusters we should settle with. There are three quantitative measures that help us here: 1) the ‘elbow’ method, which calculates the sum of distances between each item within a cluster for various values of k – and the ‘elbow point’ is the optimal number of clusters, since this is the point at which increasing k no longer yields a significant improvement in the total sum of squares; 2) The Silhouette Method - which calculates how closely an item is matched with other items in the same cluster and how loosely it is with other clusters, where a high average value is optimal; and 3) The Gap Statistic, which compares the difference between clusters in an observed dataset and clusters in randomly-generated datasets. These three methods are ultimately suggestions, and it’s up to the analyst to choose the optimal k.
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.3
###Choose the right k
#Elbow method
fviz_nbclust(modeled_data, kmeans, method = "wss") #This suggests that 2 or 4 is the optimal k
#Average silhouette method
fviz_nbclust(modeled_data, kmeans, method = "silhouette") #This suggests optimal k is 2
fviz_nbclust(modeled_data, kmeans, method = "gap_stat") #This suggests 10 is the optimal k
It seems like two of the three methods seem to point to 2 clusters. While there is no ‘one’ answer to this question, intuitively it makes sense to have somewhere between 3-6 groups for easy interpretation.
set.seed(1234)
k_2 <- kmeans(modeled_data, centers = 2, nstart = 50)
k_4 <- kmeans(modeled_data, centers = 4, nstart = 50)
k_2$size
## [1] 631 293
k_4$size
## [1] 93 435 319 77
#Assign each school to a cluster
school_data <- school_data %>%
mutate(cluster_2 = k_2$cluster) %>%
mutate(cluster_4 = k_4$cluster)
We can start by simply summarizing each of the resulting groups across our frameworks.
school_data %>%
select(cluster_2, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>%
group_by(cluster_2) %>%
summarize_all('mean')
## # A tibble: 2 × 8
## cluster_2 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.39 26.2 79.7 7399. 13.5 80.7
## 2 2 1.89 29.1 61.0 5297. 7.78 35.9
## # ℹ 1 more variable: avg_tuition <dbl>
school_data %>%
select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>%
group_by(cluster_4) %>%
summarize_all('mean')
## # A tibble: 4 × 8
## cluster_4 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.96 32.4 34.9 5452. 2.46 24.2
## 2 2 1.30 25.7 81.4 5569. 15.6 85.5
## 3 3 1.89 27.1 76.0 2728. 8.45 51.3
## 4 4 1.03 29.1 68.4 31441. 13.8 73.9
## # ℹ 1 more variable: avg_tuition <dbl>
We can also visualize the results. Note that this chart reduces the data into a two-dimensional graph via Principal Component Analysis, which makes the chart much less interpretable… but nonetheless gives a good sense of the rough shapes and overlaps present within our clusters.
fviz_cluster(k_2,
data = modeled_data,
repel = TRUE,
geom = "point",
ggtheme = theme_minimal(),
title = "Clustering Results - 2 Clusters")
fviz_cluster(k_4,
data = modeled_data,
repel = TRUE,
geom = "point",
ggtheme = theme_minimal(),
title = "Clustering Results - 4 Clusters")
To me, two groups doesn’t seem descriptive enough, so sticking with 4 groups feels right.
Let’s see the summary table on the four-group split again.
library(kableExtra)
school_data %>%
select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>%
group_by(cluster_4) %>%
summarize_all('mean') %>%
kable(digits=2, format = 'markdown',row.names = TRUE) %>%
kable_styling(full_width = T,
font_size = 12) %>%
row_spec(row = 0, color = '#660033')
| cluster_4 | CONTROL | act_score | pct_admitted | ug_enroll | pct_online | pct_instate | avg_tuition | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1.96 | 32.36 | 34.90 | 5451.74 | 2.46 | 24.15 | 52479.63 |
| 2 | 2 | 1.30 | 25.71 | 81.37 | 5569.05 | 15.64 | 85.48 | 13954.39 |
| 3 | 3 | 1.89 | 27.10 | 76.01 | 2728.06 | 8.45 | 51.29 | 32747.36 |
| 4 | 4 | 1.03 | 29.09 | 68.35 | 31440.71 | 13.84 | 73.88 | 13146.53 |
It’s interesting that even though control is not a metric included in the visualization, the four groups have a relatively clean split between public and private ones.
Now that we have a sense of what makes up our data, we can give our clusters some more informative names.
summary_table <- school_data %>%
select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition, retention_rate) %>%
group_by(cluster_4) %>%
summarize_all('mean')
summary_table <- summary_table %>%
mutate(cluster_4 = ifelse(cluster_4=='1', 'Large Publics',
ifelse(cluster_4=='2', 'Less Selective',
ifelse(cluster_4=='3', 'Small Privates', 'Elites')))) %>%
rename(cluster = cluster_4,
control = CONTROL)
kable(summary_table, 'html', digits = c(0,2,2,2,0,1,1,0,2), align = 'lcccccccc') %>%
kable_styling(full_width = F)
| cluster | control | act_score | pct_admitted | ug_enroll | pct_online | pct_instate | avg_tuition | retention_rate |
|---|---|---|---|---|---|---|---|---|
| Large Publics | 1.96 | 32.36 | 34.90 | 5452 | 2.5 | 24.2 | 52480 | 90.65 |
| Less Selective | 1.30 | 25.71 | 81.37 | 5569 | 15.6 | 85.5 | 13954 | 70.41 |
| Small Privates | 1.89 | 27.10 | 76.01 | 2728 | 8.5 | 51.3 | 32747 | 73.92 |
| Elites | 1.03 | 29.09 | 68.35 | 31441 | 13.8 | 73.9 | 13147 | 86.64 |
Let’s drive this home with some visuals.
school_data <- school_data %>%
select(-cluster_2) %>%
rename(cluster= cluster_4) %>%
mutate(cluster = ifelse(cluster=='1', 'Elites',
ifelse(cluster=='2', 'Less Selective',
ifelse(cluster=='3', 'Small Privates', 'Large Publics')))) %>%
mutate(cluster = as.factor(cluster))
library(RColorBrewer) #for pretty colors
#Average selectivity metrics by cluster
school_data %>%
select(cluster, act_score, pct_admitted, avg_tuition, ug_enroll) %>%
group_by(cluster) %>%
summarize_all('mean') %>%
pivot_longer(cols = -cluster) %>%
ggplot() +
geom_col(aes(x = cluster, y = value, fill = cluster)) +
facet_wrap(~name, scales = 'free') +
scale_fill_brewer(palette = 'Dark2') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
#Scatterplot - tuition and act scores
# my_plot <- school_data %>%
# rename(undergrad_enrollment = ug_enroll) %>%
# ggplot() +
# geom_point(aes(x = act_score, y = avg_tuition, color = cluster, size = undergrad_enrollment), alpha = 0.5) +
# scale_color_brewer(palette = 'Dark2') +
# theme_minimal() +
# labs(x = 'ACT Score (Composite, 75th Percentile)',
# y = 'Average Tuition Rate',
# title = 'University Selectivity - ACT Scores and Avg. Tuition by Cluster')
#ggplotly(my_plot) # Convert to Plotly object
my_plot <- plot_ly(data = school_data,
x = ~act_score, y = ~avg_tuition, type = "scatter", mode = "markers",
color = ~cluster,
size = ~ug_enroll,
hoverinfo = "text", text = ~paste("School: ", school, "<br> Undergrad Enrollment: ", ug_enroll)) %>%
plotly::layout(title = "Selectivity Metrics for Schools Based on Cluster",
xaxis = list(title = "ACT Score (Composite, 75th Percentile)"),
yaxis = list(title = "Average Tuition Rate"))
ggplotly(my_plot)
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
So there you have it. Even though individual schools can have similar values for various selectivity measures, k-means helps us tell whether a school really is similar to another, not just based on one or two metrics, but the sum total of all the metrics that we deem important.
A good extension of this analysis could be to use the results of these groupings in a predictive model, allowing you to build different models for each cluster of schools. This is potentially useful since each resulting model’s coefficients need only to describe a certain subset of the data as opposed to all the data, which could ultimately result in a more accurate model.
And what about those schools I brought up in the beginning, College of Wooster and Dayton? Turns out they do belong to different groups in our analysis, with Dayton in the ‘Small Privates’ group and Wooster in the ‘Elites’ group.
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.2.3
school_data %>%
rename(undergrad_enrollment = ug_enroll) %>%
ggplot(aes(x = act_score, y = avg_tuition)) +
geom_point(aes(color = cluster, size = undergrad_enrollment), alpha = 0.5) +
geom_text_repel(aes(label=ifelse(school == 'The College of Wooster' | school == 'University of Dayton', school, '')), nudge_x = 2, nudge_y = -4500, max.overlaps = 2000) +
scale_color_brewer(palette = 'Dark2') +
theme_minimal() +
labs(x = 'ACT Score (Composite, 75th Percentile)',
y = 'Average Tuition Rate',
title = 'University Selectivity - ACT Scores and Avg. Tuition by Cluster')